DRAM/NVM hierarchical heterogeneous memory access method and system with software-hardware cooperative management

ABSTRACT

The present invention provides a DRAM/NVM hierarchical heterogeneous memory system with software-hardware cooperative management schemes. In the system, NVM is used as large-capacity main memory, and DRAM is used as a cache to the NVM. Some reserved bits in the data structure of TLB and last-level page table are employed effectively to eliminate hardware costs in the conventional hardware-managed hierarchical memory architecture. The cache management in such a heterogeneous memory system is pushed to the software level. Moreover, the invention is able to reduce memory access latency in case of last-level cache misses. Considering that many applications have relatively poor data locality in big data application environments, the conventional demand-based data fetching policy for DRAM cache can aggravates cache pollution. In the present invention, an utility-based data fetching mechanism is adopted in the DRAM/NVM hierarchical memory system, and it determines whether data in the NVM should be cached in the DRAM according to current DRAM memory utilization and application memory access patterns. It improves the efficiency of the DRAM cache and bandwidth usage between the NVM main memory and the DRAM cache.

TECHNICAL FIELD

The present invention belongs to the field of cache performanceoptimization in a DRAM/NVM heterogeneous memory environment, and inparticular, a DRAM/NVM hierarchical heterogeneous memory access methodand system with software-hardware cooperative management schemes aredesigned, and an utility-based data fetching mechanism is proposed inthis system.

BACKGROUND ART

With the development of the multi-core and multi-threading technology,Dynamic Random Access Memory (DRAM) can no longer meet the growingmemory demand of applications due to restrictions in terms of powerconsumption and techniques. Emerging Non-Volatile Memories (NVMs), suchas Phase Change Memory (PCM), Spin Transfer Torque Magneto resistiveRandom Access Memory (STT-MRAM), and Magnetic Random Access Memory(MRAM), have features such as byte-addressable, comparable read speedwith DRAM, near-zero standby power consumption, high density (storingmore data per chip), and high scalability, and may serve as a substitutefor DRAM as the storage medium of main memory. However, compared withthe DRAM, these new non-volatile memories still have a lot ofdisadvantages: (1) a relatively high read/write delay, where the readspeed is approximate twice slower than that of DRAM, and the write speedis almost five times slower than that of the DRAM; (2) high write powerconsumption; and (3) a limited endurance life. Therefore, it isunfeasible to directly use these emerging non-volatile memories as thecomputer main memory. A mainstream approach at present is to integrate alarge amount of non-volatile memories with a small amount of DRAMs toform a heterogeneous memory system, so that the performance, powerefficiency and endurance of the memory system are improved by exploitingthe advantage of large capacity of the non-volatile memory and theadvantages of a low memory access delay, low write power consumption,and high endurance of the DRAM.

There are mainly two types of heterogeneous memory architectures atpresent: flat and hierarchical heterogeneous memory architectures.

In a heterogeneous memory system with a flat architecture, the NVM andthe DRAM are uniformly addressable, and both NVM and DRAM are used asmain memory. To improve the power efficiency and performance of thesystem, hot page migration is a common optimization policy adopted inthis architecture. That is, frequently accessed NVM page frames aremigrated to the DRAM. A migration operation is generally divided intotwo sequential steps: (1) copying the content of source and target pageframes into a buffer; and (2) writing the data in the buffer into targetaddresses. Therefore, one page migration operation may generate fourtimes of page replication, and thus the time cost of the migrationoperations are relatively large because the reading phase and thewriting phase are performed sequentially. Besides, if a memory systemsupports 2 MB or 4 MB superpage to reduce TLB miss rate, the hot pagemigration mechanism can leads to tremendous time and space overhead.

In a heterogeneous memory system with a hierarchical architecture,high-performance memories such as DRAM are used as a cache to thenon-volatile memory. As memory access to DRAM cache is more efficientthan that of NVM, a heterogeneous memory system with a hierarchicalarchitecture can achieve better application performance compared withthe heterogeneous memory system with a flat architecture. In aconventional hierarchical heterogeneous memory system, a DRAM cache ismanaged by hardware and is transparent to operating systems, and theorganization of the DRAM cache is similar to a conventional on-chipcache. When a LLC miss occurs, the hardware circuit in DRAM memorycontroller is responsible for tag lookup of a physical page address. Itdetermines whether data access is hit in the DRAM cache, and thenperforms actual data access. This implies that the hierarchicalheterogeneous memory system has a relatively long access delay when aDRAM cache miss occurs. In addition, the hardware-managed DRAM cachegenerally adopts a demand-based data fetching mechanism. When data isnot hit in DRAM cache, the NVM data block corresponding to the missingdata should be fetched into DRAM cache first and then is loaded toon-chip last-level cache. In big data environments, a lot ofapplications have poor temporal/spatial locality, and such data fetchingmechanism would aggravate cache pollution.

SUMMARY

In view of the disadvantages of the prior techniques, the presentinvention provides a DRAM/NVM hierarchical heterogeneous memory systemwith software-hardware cooperative management, which is aimed ateliminating large hardware costs and reducing memory access delay in aconventional hierarchical heterogeneous memory system. Moreover, DRAMcache can be managed in a more flexible manner by software, and thus theDRAM cache utilization and memory access throughput between NVM and DRAMare improved. In a hardware view, the data layout of TranslationLookaside Buffer (TLB) is modified in the present invention. In asoftware view, the data structure of current page table is extended tosupport a mapping from NVM pages to DRAM pages. Furthermore, anutility-based data fetching policy is designed to improve theutilization of DRAM cache and bandwidth usage between NVM main memoryand DRAM cache.

To achieve the aforementioned objectives, a hardware-managed DRAM/NVMhierarchical memory access method is provided, including:

step 1: address translation in TLB: acquiring the physical page number(ppn), the P flag, and content of the overlap tlb field from an TLBentry which corresponds to a virtual page number (vpn), and translatingthe virtual address into an NVM physical address according to the ppn;

step 2: determining whether memory access is hit in the on-chip cache;directly fetching, by a CPU, a requested data block from the on-chipcache if the memory access is hit, and ending a memory access process;otherwise, turning to step 3;

step 3: determining a memory access type according to the P flagacquired in step 1; if P is 0, which indicates access to an NVM memory,turning to step 4, updating information of the overlap tlb field (wherethe field is used as a counter in this case) in the TLB entry, anddetermining, whether the NVM physical page corresponding to the virtualpage should be fetched into the DRAM cache according to the fetchingthreshold in the dynamic threshold adjustment algorithm and the overlaptlb field acquired in step 1. Otherwise, if P is 1, which indicatesaccess to a DRAM cache and indicates that the cache is hit, calculatingthe physical address of the to-be-accessed DRAM cache according to theinformation of the overlap tlb field acquired in step 1 and the offsetof virtual address, and turning to step 6 to access the DRAM cache;

step 4: looking up the TLB for an entry corresponding to an NVM mainmemory page if the value of the overlap tlb field acquired in step 1 isless than the fetching threshold, and increasing the value of overlaptlb field (where the field is used as a counter in this case) of the TLBby one, go to step 6 to directly access the NVM memory, where thefetching threshold is determined by a fetching threshold runtimeadjustment algorithm; if the value of the overlap tlb field acquired instep 1 is greater than the fetching threshold, go to step 5 to fetch theNVM main memory page into the DRAM;

step 5: prefetching the NVM page corresponding to the virtual addressinto the DRAM cache, and updating the TLB and the extended page table;and step 6: memory access: accessing the NVM page according to thephysical address delivered to the memory controller.

According to another embodiment of the present invention, asoftware-managed DRAM/NVM hierarchical heterogeneous memory system isprovided, including modified TLB layout, extended page table, and autility-based data fetching module, where:

the modified TLB is used to cache the address mapping of virtual-to-NVMand also virtual-to-DRAM. This mechanism improves the efficiency ofaddress translation. In addition, some reserved bits in TLB is furtherused to record page access frequency of applications, so as to assistthe utility-based data fetching module for decision making;

the extended page table stores all mapping from virtual pages tophysical pages and mapping from NVM pages to DRAM cache pages, where aPD flag is used to indicate the type of a page frame recorded in a pagetable entry; and

the utility-based data fetching module is used to replace thedemand-based data fetching policy in a conventional hardware-managedcache. The module mainly includes three sub-modules: a memory monitoringmodule, a fetching threshold runtime adjustment module, and a dataprefetcher: (1) the memory monitoring module is used to acquire DRAMcache utilization and NVM page access frequency from the modified TLBand memory controller, and use them as input information for datafetching threshold runtime adjustment module; (2) the fetching thresholdruntime adjustment module is used to dynamically adjust the datafetching threshold according to the runtime information provided by thememory monitoring module, to improve DRAM cache utilization andbandwidth usage between the NVM and the DRAM; (3) the data prefetcher isused to: {circle around (1)} trigger a buddy allocator of the DRAM cachemanagement module to allocate a DRAM page for caching a NVM memory page;{circle around (2)} copy content of the NVM memory page to the DRAM pageallocated by the buddy allocator; and {circle around (3)} update theextended page table and the TLB.

In general, compared with the previous techniques, the present inventionhave the following advantages:

(1) The system modifies the TLB data structure and extends the pagetable, which enable uniform management of DRAM cache pages and NVM mainmemory pages, and eliminates hardware costs in a conventional DRAM/NVMhierarchical memory system while reducing memory access delay.

(2) The system monitors memory accesses using some TLB reserved bits,and eliminates huge hardware costs compared with the conventionalDRAM/NVM heterogeneous memory system that monitors the page accessfrequency in the memory controller.

(3) The system develops a utility-based cache fetching method, and adynamic fetching threshold adjustment algorithm is designed todynamically adjust the fetching threshold according to applicationmemory access locality and utilization of the DRAM cache, which improvesutilization of the DRAM cache and bandwidth usage between the NVM mainmemory and the DRAM cache.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a structural diagram of a DRAM-NVM hierarchical heterogeneousmemory system with software-hardware cooperative management according tothe present invention;

FIG. 2 shows last-level page table entry structures before and afterextension according to the present invention;

FIG. 3 shows TLB entry structures before and after modificationaccording to the present invention;

FIG. 4 is a working flow chart of a DRAM-NVM heterogeneous memory systemaccording to the present invention;

FIG. 5 is a flow chart of translation from a virtual address to aphysical address in a DRAM-NVM hierarchical heterogeneous memory systemaccording to the present invention;

FIG. 6 is a flow chart of TLB lookup and update in a DRAM-NVMhierarchical heterogeneous memory system according to the presentinvention;

FIG. 7 is a flow chart of page table query in a DRAM-NVM hierarchicalheterogeneous memory system according to the present invention; and

FIG. 8 is a flow chart of prefetching an NVM page to a DRAM cache in aDRAM-NVM hierarchical heterogeneous memory system according to thepresent invention.

DETAILED DESCRIPTION

To illustrate the objectives, technical solutions and advantages of thepresented invention more clearly, the following further describes thedetails of this invention with figures and case studies. It should benoted that, the specific cases described in this invention are only usedto illustrate the present invention rather than limiting the applicationscenarios of this invention.

FIG. 1 shows system architecture of a DRAM/NVM hierarchicalheterogeneous memory system with software-hardware cooperativemanagement. The hardware layer includes a modified TLB, and the softwarelayer includes extended page table, a utility-based data fetchingmodule, and a DRAM cache management module. The extended page table ismainly used to manage the mapping from virtual pages to physical pagesand the mapping from NVM memory pages to DRAM cache pages. The modifiedTLB caches page table entries that are frequently accessed in theextended page table, thereby improve the efficiency of addresstranslation. In addition, some reserved bits in the modified TLB isfurther used to record access frequency of NVM memory pages, and toassist the utility-based data fetching module in data fetchingoperations. The utility-based data fetching module is mainly used tocache the NVM page in the DRAM, and it includes three sub-modules: amemory monitoring module, a fetching threshold runtime adjustmentmodule, and a data fetcher, where the memory monitoring module is usedto collect page access frequency information and cache utilization fromthe TLB and memory controller, respectively; the fetching thresholdruntime adjustment module dynamically adjusts the page fetchingthreshold according to the information collected by the memorymonitoring module. When access frequency of a NVM memory page is greaterthan the threshold, the data fetcher is invocated to cache the NVM pageinto the DRAM. The DRAM cache management module is mainly used to managethe DRAM cache, and mainly includes a buddy allocator sub-module. When aNVM page needs to be cached into the DRAM, the buddy allocator allocatesa DRAM page to load the content of the NVM page according to a buddyallocation algorithm, and meanwhile updates the extended page table toset up a mapping from the NVM page to the DRAM page, so that the DRAMpage can be written back to the corresponding NVM page when the DRAMpage is reclaimed.

To push the management of DRAM cache to the software level, in thepresent invention, DRAM cache and NVM are uniformly managed in a flatphysical address space, and the data structure of last-level page tableis extended, as shown in FIG. 2. A conventional page table entrygenerally includes flags such as P, R/W, U/S, D, and AVAIL, where the Pflag indicates whether a page recorded in the page table entry has beenfetched into the memory; the R/W flag records an read/write operationpermission on the page recorded in the page table entry, where Rrepresents that read is allowed, and W represents that write is allowed;the U/S flag indicates whether the page recorded in the page table entryis a user mode page frame or a kernel mode page frame; the D flag isused to record whether any data has been written into the pagecorresponding to the entry. On the basis of a conventional page table, aPD flag bit and an overlap field are added to the last-level page tableentry after extension. The PD flag (1 bit) indicates the medium type ofa physical page frame corresponding to a virtual page frame. When PD is0, it indicates that the physical page frame corresponding to thevirtual page frame is an NVM main memory; and when PD is 1, it indicatesthat the physical page frame corresponding to the virtual page frame hasbeen fetched from the NVM main memory into the DRAM cache. The overlapfield records different content according to different PD flags: when PDis 0, the field is used as a counter and records the number of memoryaccesses to the page referenced by the page table entry; and when PD is1, the field records the physical address of a DRAM page correspondingto an NVM page, thereby establishing a mapping from the NVM page to theDRAM page.

The TLB, as one of the most important hardware in a computer system,eliminates most of page table walk overheads by caching the recentlyaccessed page table entries, and thus significantly improve theefficiency of translation from a virtual address to a physical address.In this invention, some reserved bits of the TLB entry are used toextend the data structure a TLB entry, so that a TLB entry can cache anextended page table entry, as shown in FIG. 3. On the basis of thecurrent data structure of the TLB entry, a P flag bit and an overlap tlbfield are added in the extended TLB entry. Similar to the PD flag in thepage table entry, the P flag is used to indicate the medium type of aphysical page frame corresponding to a virtual page frame recorded inthe TLB entry. When P is 0, it indicates that the TLB entry records themapping from a virtual page to a NVM physical page, and the overlap tlbfield is used to record the number of memory accesses to a pagereferenced by a TLB entry. When P is 1, it indicates that the physicalpage recorded in the TLB entry has been fetched from the NVM main memoryinto the DRAM cache, and the overlap tlb field records the DRAM pagenumber corresponding to the NVM page.

FIG. 4 is a working flow chart of a DRAM-DRAM hierarchical heterogeneousmemory system with software-hardware cooperative management according tothe present invention, where specific steps are as follows:

Step 1: TLB address translation: acquire the physical page number ppn,the P flag, and content of the overlap tlb field in a TLB entry whichcorresponds to a virtual page, and translate the virtual address into anNVM physical address according to the ppn. As shown in FIG. 5, step 1specifically includes the following sub-steps:

(1-1) TLB lookup: Look up the TLB entry by using a virtual page numberas the keyword, to acquire a physical page number ppn, a P flag bit andcontent of an overlap tlb field corresponding to the virtual pagenumber, as shown in FIG. 6a ).

(1-2) If an entry corresponding to the virtual address is found in theTLB, that is, the TLB is hit, acquire the physical page numbercorresponding to the virtual page, translate the virtual address intophysical address, and go to step 2 to access an on-chip cache by usingthe physical address; otherwise, i.e., the TLB is missing, and a pagetable walk is needed. This process is similar to a page table walkprocess in conventional architectures, as shown in FIG. 7. It includesthe following detailed steps:

(1-2-1) Acquire the base address of level-1 page table by using the CR3register; acquire the index a of level-1 page directory according to thevirtual address, and add the index a to the base address of level-1 pagetable to calculate the base address of level-2 page directory.

(1-2-2) Access the entry in level-1 page directory according to theaddress obtained in (1-2-1) and acquire the base address of level-2 pagedirectory addr[a]. Add addr[a] to index b of the level-2 page directoryto calculate the base address of level-3 page directory.

(1-2-3) Repeat this process till a last-level page table is accessed. Ifthe entry corresponding to the virtual address is valid, acquire thecorresponding physical page number addr [c], the flag PDc whichindicates the storage medium type of a page frame (when PDc is 1,meaning that the page frame is in DRAM, otherwise the PDc is 0, meaningthat the page frame is in NVM), and content of an overlap field;otherwise the accessed entry is invalid, load the missing page into theNVM main memory from the external storage, update the page table, andthen go to (1-2-4) to update the TLB.

(1-2-4) TLB update: insert a TLB entry which sets up a mapping from thevirtual address to the physical address (virtual page number vpn,physical page number addr [c], PDc, and overlap field c) into the TLB,as shown in FIG. 6(b). The TLB update mainly includes the following twosub-steps:

(1-2-4-1) If PDc is 0, which indicates that the physical pagecorresponding to the virtual page is an NVM page, set the P flag of theTLB entry to 0; otherwise, the physical page corresponding to thevirtual page is a DRAM page, and the P flag of the TLB entry is set to1.

(1-2-4-2) If the TLB table has free space, create a new entry and insert(virtual page number vpn, physical page number addr[c], PDc, overlapfield) into the new TLB entry, where the overlap field corresponds tothe overlap tlb field in the TLB; If the TLB table is full, reclaim aTLB entry using LRU replacement algorithm, and then install a new TLBentry.

Step 2: Determine whether the memory access is hit in the on-chip cache,directly fetch, by a CPU, a requested data block from the on-chip cacheif the memory access is hit, and end the memory access process;otherwise, turn to step 3;

step 3: determine the storage medium of memory access according to the Pflag acquired in step 1; if P is 0, which indicates an NVM page access,and go to step 4, updating information of the overlap tlb field (wherethe field is used as a counter in this case) in the TLB entry, anddetermine whether the NVM physical page corresponding to the virtualpage should be fetched into the DRAM cache according to the fetchingthreshold in the dynamic threshold adjustment algorithm and the overlaptlb field acquired in step 1. Otherwise, if P is 1, this indicates aDRAM page access and the cache is hit, and then the physical address ofthe to-be-accessed DRAM cache is obtained according to the overlap tlbfield acquired in step 1 and the offset of virtual address, and go tostep 6 to access the DRAM cache;

step 4: Look up the TLB for an entry corresponding to an NVM main memorypage if the value of the overlap tlb field acquired in step 1 is lessthan the fetching threshold, and increase the value of overlap tlb field(where this field is used as a counter in this case) in the TLB by one,go to step 6 to directly access the NVM memory, where the fetchingthreshold is determined by a fetching threshold runtime adjustmentalgorithm; if the value of the overlap tlb field acquired in step 1 isgreater than the fetching threshold, go to step 5 to fetch the NVM mainmemory page into the DRAM; The fetching threshold runtime adjustmentalgorithm includes the following sub-steps:

(4-1) Acquire, by a memory monitoring module, the number of NVM pagefetching times n_(fetch), the number of cache read times n_(dram) _(_)_(read), and the number of cache write times n_(dram) _(_) _(write) fromthe memory controller. Assume the average read and write latencies ofthe NVM are t_(nvm) _(_) _(read) and t_(nvm) _(_) _(write),respectively, the average read and write latencies of the DRAM aret_(dram) _(_) _(read) and t_(dram) _(_) _(write), respectively, and theoverhead of caching a DRAM page is t_(fetch). and calculates, every 10⁹clocks by using formula 4.1, a performance gain benefited from cachingNVM page to DRAM in every 10⁹ clocks:benefit_(t) =n _(dram) _(_) _(read)×(t _(nvm) _(_) _(read) −t _(dram)_(_) _(read))+n _(dram) _(_) _(write)×(t _(nvm) _(_) _(write) −t _(dram)_(_) _(write))−n _(fetch) ×t _(fetch)   (Formula 4.1)

(4-2) Assume that the initial fetching threshold is fetch_thres₀(fetch_thres₀≥0), the fetching threshold and performance gain in theprevious 10⁹-clock period are fetch_thres_(t−1) and benefiet_(t−1),respectively, and the fetching threshold and the performance gain ofcurrent 10⁹-clock period are fetch_thres_(t) and benefiet_(t),respectively, and the cache utilization is dram_usage. Adjust theprefetching threshold by using a hill climbing algorithm ifdram_usage>30%, which mainly includes the following sub-steps:

4-2-1) If it is the first time adjusting the threshold, calculate theperformance gain under the given fetching threshold. If benefiet_(t)≥0it indicates that data block fetching can improve system performance,and if fetch_thres₀>0, fetch_thres_(t)=fetch_thres₀−1; otherwisebenefiet_(t)<0, it indicates that data block fetching may decreasesystem performance, fetch_thres_(t)=fetch_thres₀+1. If it is not thefirst time adjusting the threshold, go to the next step;

4-2-2) If benefit_(t)>benefit_(t−1), it indicates that the fetchingthreshold adjusting method used in the previous 10⁹-clock period is ableto improve the system performance, and the threshold adjusting method iskept the same as the action in the previous 10⁹-clock period. That is,if the fetching threshold is decreased in the previous 10⁹-clock periodand fetch_thres_(t−1)>0, fetch_thres_(t)=fetch_thres_(t−1)−1; and if thefetching threshold is increased in the previous 10⁹-clock period,fetch_thres_(t)=fetch_thres_(t−1)+1; Otherwisebenefit_(t)<benefit_(t−1), use a threshold adjusting method opposite tothat of the previous 10⁹-clock period.

4-2-3) Update benefiet_(t−1) to be benefiet_(t).

Step 5: fetch the NVM page corresponding to the virtual address into theDRAM cache, and update the TLB and the extended page table. As shown inFIG. 8, this step mainly includes the following sub-steps:

(5-1) If the DRAM cache is full, LRU algorithm is used to determinewhich DRAM cache page can be reclaimed. If the cached page is dirty,look up the extended page table to obtain the physical address of theNVM memory page corresponding to the DRAM page, and then write themodified DRAM page back to its corresponding NVM page, set the P flag ofthe TLB entry corresponding to the DRAM page to 0, and set the PD flagof the extended page table entry as 0 (indicating that the physical pagecorresponding to the virtual page is now in the NVM main memory); If theDRAM cache is not full, go to step (5-2).

(5-2) Call the buddy allocator of the DRAM cache management module toallocate an free DRAM page, and assume the address of the free DRAM pageis dram_ppn.

(5-3) Set up the mapping from the NVM page to the DRAM cache page in theextended page table and the TLB. That is, set the overlap field of theextended page table as dram_ppn, and set the PD flag of the extendedpage table as 1; and set the overlap tlb field of the TLB as dram_ppn,and set the P flag of the TLB as 1.

(5-4) Call the memory controller to copy the NVM page to thecorresponding DRAM page.

Step 6: Memory access: access the memory according to an addresstransmitted into the memory controller.

In the present invention, a DRAM/NVM hierarchical heterogeneous memorysystem with software-hardware cooperative management scheme is designed.By extending the last-level page table and TLB, step 1 and step 5eliminate hardware costs in a conventional hardware-managed DRAM/NVMhierarchical memory systems and reduce memory access latency in case ofa last-level cache miss. DRAM cache management is pushed to the softwarelevel, and thus improve the flexibility of the DRAM/NVM hierarchicalheterogeneous memory systems. Considering that applications with poordata locality would cause severe cache pollution, this invention adoptsan utility-based data caching algorithm to filter the DRAM cache, asdescribed in step 3 and step 4. It improves efficiency of the DRAM cacheand bandwidth usage from the NVM memory to the DRAM cache.

Aspects:

The following numbered aspects provide further disclosure of theinvention. It is noted that any of aspects 1-4 below can be combinedwith any of aspects 5-8.

1. A DRAM/NVM hierarchical heterogeneous memory access method withsoftware-hardware cooperative management scheme, comprising thefollowing steps:

step 1: address translation in TLB: acquiring the physical page number(ppn), the P flag, and content of the overlap TLB field from an TLBentry which corresponds to a virtual page number (vpn), and translatingthe virtual address into an NVM physical address according to the ppn;

step 2: determining whether memory access is hit in an on-chip cache;directly fetching, by a CPU, a requested data block from the on-chipcache if the memory access is hit, and ending a memory access process;otherwise, turning to step 3;

step 3: determining the storage medium of memory access according to theP flag acquired in step 1; if P is 0, this indicates an NVM page access,and go to step 4, updating information of the overlap tlb field (wherethe field is used as a counter in this case) in the TLB entry, anddetermining, according to the fetching threshold in the dynamicthreshold adjustment algorithm and the overlap tlb field acquired instep 1, whether to fetch the NVM physical page corresponding to thevirtual page into the DRAM cache. Otherwise, if P is 1, which indicatesa DRAM page access and the cache is hit, and then the physical addressof the to-be-accessed DRAM cache is obtained according to the overlapTLB field acquired in step 1 and the offset of virtual address, and goto step 6 to access the DRAM cache;

step 4: looking up the TLB for an entry corresponding to an NVM mainmemory page if the value of the overlap TLB field acquired in step 1 isless than the fetching threshold, and increasing the value of overlaptlb field (where the field is used as a counter in this case) in the TLBby one, turning to step 6 to directly access the NVM memory, where thefetching threshold is determined by a fetching threshold runtimeadjustment algorithm; turning to step 5 to fetch the NVM main memorypage into the DRAM if the value of the overlap TLB field acquired instep 1 is larger than the fetching threshold;

step 5: fetching the NVM page corresponding to the virtual address intothe DRAM cache, and updating the TLB and the extended page table; and

step 6: memory access: accessing the memory according to an addresstransmitted into a memory controller.

2. The method according to aspect 1, wherein step 4 comprises thefollowing sub-steps:

(4-1) acquiring, by a memory monitoring module, the number ofprefetching times n_(fetch), the number of cache read times n_(dram)_(_) _(read), and the number of cache write times n_(dram) _(_) _(write)from the memory controller, wherein in the system, the average read andwrite latencies of the NVM are t_(nvm) _(_) _(read) and t_(nvm) _(_)_(write), respectively, the average read and write latencies of the DRAMare t_(dram) _(_) _(read) and t_(dram) _(_) _(write), respectively, andthe overhead of caching a DRAM page is t_(fetch). Formula 4.1 calculatesthe system performance gain benefited from caching NVM page to DRAM inevery 10⁹ clocks:benefit_(t) =n _(dram) _(_) _(read)×(t _(nvm) _(_) _(read) −t _(dram)_(_) _(read))+n _(dram) _(_) _(write)×(t _(nvm) _(_) _(write) −t _(dram)_(_) _(write))−n _(fetch) ×t _(fetch)   (Formula 4.1)

(4-2) Assuming that the initial fetching threshold is fetch_thres₀(fetch_thres₀≥0), the fetching threshold and performance gain in theprevious 10⁹-clock period are fetch_thres_(t−1) and benefiet_(t−1),respectively, and the fetching threshold and the performance gain ofcurrent 10⁹-clock period are fetch_thres_(t) and benefiet_(t),respectively, and the cache utilization is dram_usage. Ifdram_usage>30%, the hill climbing algorithm is adopted to adjust thefetching threshold.

3. The method according to aspect 2, wherein the adjusting theprefetching threshold by using a hill climbing algorithm in step (4-2)comprises the following sub-steps:

4-2-1) If it is the first time adjusting the threshold, calculate theperformance gain under the given fetching threshold. If benefiet_(t)≥0,it indicates that data block fetching can improve system performance,and if fetch_thres₀>0, fetch_thres_(t)=fetch_thres₀−1; otherwisebenefiet_(t)<0, it indicates that data block fetching may decreasesystem performance, fetch_thres_(t)=fetch_thres₀+1. If it is not thefirst time adjusting the threshold, go to the next step;

4-2-2) benefit_(t)>benefit_(t−1), it indicates that the fetchingthreshold adjusting method used in the previous 10⁹-clock period is ableto improve the system performance, and the threshold adjusting method iskept the same as the action in the previous 10⁹-clock period. That is,if the fetching threshold is decreased in the previous 10⁹-clock periodand fetch_thres_(t−1)>0, fetch_thres_(t)=fetch_thres_(t−1)−1; and if thefetching threshold is increased in the previous 10⁹-clock period,fetch_thres_(t)=fetch_thres_(t−1)+¹; Otherwisebenefit_(t)<benefit_(t−1), use a threshold adjusting method opposite tothat of the previous 10⁹-clock period; and 4-2-3) updatingbenefiet_(t−1) to be benefiet_(t).

4. The method according to aspect 1 or 2, wherein step 5 comprises thefollowing sub-steps:

(5-1) determining, by using an LRU algorithm, the address of a DRAMcache page to be reclaimed if the DRAM cache is full, and looking up theextended page table to obtain an address of an NVM memory pagecorresponding to the DRAM page if the cache page is written and dirty,writing modified cache page content back to a corresponding NVM mainmemory, setting a P flag of a TLB entry corresponding to the DRAM pageas 0, and setting a PD flag of an entry of the extended page table as 0;or turning to step (5-2) if the DRAM cache is not full;

(5-2) calling the buddy allocator of the DRAM cache management module toallocate an free DRAM page, and assume the address of the free DRAM pageis dram_ppn.

(5-3) setting up the mapping from the NVM page to the DRAM cache page inthe extended page table and the TLB. That is, set the overlap field ofthe extended page table as dram_ppn, and set the PD flag of the extendedpage table as 1; and set the overlap tlb field of the TLB as dram_ppn,and set the P flag of the TLB as 1.

(5-4) calling the memory controller to copy the NVM page to thecorresponding DRAM page.

5. A DRAM/NVM hierarchical heterogeneous memory system withsoftware-hardware cooperative management scheme, comprising modified TLBlayout, extended page table, and utility-based data fetching module. Thesignificant features include:

the modified TLB is used to cache the address mapping of virtual-to-NVMand also virtual-to-DRAM. This mechanism improves the efficiency ofaddress translation. In addition, some reserved bits in TLB is furtherused to record page access frequency of applications, so as to assistthe utility-based data fetching module for decision making;

the extended page table stores all mapping from virtual pages tophysical pages and mapping from NVM pages to DRAM cache pages, where aPD flag is used to indicate the type of a page frame recorded in a pagetable entry; and

the utility-based data fetching module is used to replace thedemand-based data fetching policy in a conventional hardware-managedcache. The module mainly includes three sub-modules: a memory monitoringmodule, a fetching threshold runtime adjustment module, and a dataprefetcher: (1) the memory monitoring module is used to acquire DRAMcache utilization and NVM page access frequency from the modified TLBand memory controller, and use them as input information for datafetching threshold runtime adjustment module; (2) the fetching thresholdruntime adjustment module is used to dynamically adjust the datafetching threshold according to the runtime information provided by thememory monitoring module, to improve DRAM cache utilization andbandwidth usage between the NVM and the DRAM; (3) the data prefetcher isused to: {circle around (1)} trigger a buddy allocator of the DRAM cachemanagement module to allocate a DRAM page for caching a NVM memory page;{circle around (2)} copy content of the NVM memory page to the DRAM pageallocated by the buddy allocator; and {circle around (3)} update theextended page table and the TLB.

6. The DRAM/NVM hierarchical heterogeneous memory system withsoftware-hardware cooperative management according to aspect 5, whereinthe data structures of the last-level page table entry and the TLB entryare modified, and the mapping from virtual pages to physical pages andthe mapping from NVM memory pages to DRAM cache pages are uniformlymanaged, which improves an access speed while ensuring the correctnessof cache access and reclaiming, and eliminates hardware overheads of aconventional hardware-managed DRAM/NVM hierarchical memory systems.

7. The DRAM-NVM hierarchical heterogeneous memory system withsoftware-hardware cooperative management according to aspect 5 or 6,wherein an overlap tlb field is designed by using some reserved bits ofthe TLB entry, and this field is fully utilized to monitor the pageaccess frequency or record the DRAM page number corresponding to an NVMpage.

8. The DRAM-NVM hierarchical heterogeneous memory system withsoftware-hardware cooperative management according to aspect 5 or 6,wherein the monitoring module of the utility-based data fetching moduleacquires the information of page access frequency and cache utilizationfrom TLB and memory controller; and the fetching threshold isdynamically adjusted by using a fetching threshold runtime adjustmentalgorithm, to improve cache efficiency and bandwidth usage between theNVM memory and the DRAM cache.

Practitioners in this field can easily understand this invention. Thedescriptions in the above are only preferred embodiments of thepresented invention, but are not intended to limit the presentedinvention. Any modification, equivalent replacement and improvement madewithin the rationale and principle of the presented invention shall fallwithin the protection scope of the presented invention.

The invention claimed is:
 1. A Dynamic Random Access Memory/Non-Volatile Memory (DRAM/NVM) hierarchical heterogeneous memory access system with software-hardware cooperative management configured to perform the following steps: step 1: Translation Lookaside Buffer (TLB) address translation: acquiring a physical page number (ppn), a P flag, and content of an overlap tlb field of an entry where a virtual page is located, and translating a virtual address into an NVM physical address according to the ppn; step 2: determining whether memory access is hit in an on-chip cache; directly fetching, by a Central Processing Unit (CPU), a requested data block from the on-chip cache if the memory access is hit, and ending a memory access process; otherwise, turning to step 3; step 3: determining a memory access type according to the P flag acquired in step 1; if P is 0, which indicates access to an NVM, turning to step 4, updating information of the overlap tlb field in a TLB table, and determining, according to an automatically adjusted prefetching threshold in a dynamic threshold adjustment algorithm and the overlap tlb field acquired in step 1, whether to prefetch an NVM physical page corresponding to the virtual page into a DRAM cache; or if P is 1, which indicates access to a DRAM cache and indicates that the DRAM cache is hit, calculating an address of the DRAM cache to be accessed according to the information of the overlap tlb field acquired in step 1 and a physical address offset, and turning to step 6 to access the DRAM cache; step 4: looking up a TLB entry corresponding to an NVM main memory page if a value of the overlap tlb field acquired in step 1 is less than the automatically adjusted prefetching threshold, and increasing the overlap tlb field of the TLB by one, wherein the overlap tlb field is a counter; turning to step 5 if the value of the overlap tlb field acquired in step 1 is greater than the automatically adjusted prefetching threshold, to prefetch the NVM main memory page into the DRAM cache; otherwise, turning to step 6 to directly access the NVM, wherein the automatically adjusted prefetching threshold is determined by a prefetching threshold runtime adjustment algorithm; step 5: prefetching the NVM physical page corresponding to the virtual address into the DRAM cache, and updating the TLB and an extended page table; and step 6: memory access: accessing the memory according to an address transmitted into a memory controller.
 2. The system according to claim 1, wherein step 4 comprises the following sub-steps: (4-1) acquiring, by a memory monitoring module, a number of prefetching times n_(fetch), a number of cache read times n_(dram) _(_) _(read), and a number of cache write times n_(dram) _(_) _(write) from the memory controller, wherein in the system, assuming average read and write delays of the NVM are t_(nvm) _(_) _(read) and t_(nvm) _(_) _(write), respectively, average read and write delays of the DRAM cache are t_(dram) _(_) _(read) and t_(dram) _(_) _(write), respectively, and an overhead of fetching DRAM page is t_(fetch), and calculating, every 10⁹ clocks by using: benefit=n_(dram) _(_) _(read)×(t_(nvm) _(_) _(read)−t_(dram) _(_) _(read))+n_(dram) _(_) _(write)×(t_(nvm) _(_) _(write)−t_(dram) _(_) _(write))−n_(fetch)×t_(fetch), a performance benefit of the system brought about by memory page prefetching; (4-2) assuming that an initial prefetching threshold is fetch_thres₀ (fetch_thres₀≥0), the prefetching threshold and a performance benefit of a previous 10⁹-clock period are fetch_thres_(t−1) and benefit_(t−1) respectively, the prefetching threshold and a performance benefit of a current 10⁹-clock period are fetch_thres_(t) and benefit_(t), respectively, and a cache usage is dram_usage, adjusting the prefetching threshold by using a hill climbing algorithm if dram_usage>30%.
 3. The system according to claim 2, wherein the adjusting the prefetching threshold by using a hill climbing algorithm in step (4-2) comprises the following sub-steps: 4-2-1) if the prefetching threshold is adjusted for a first time: if benefit_(t)≥0, it indicates that data block prefetching can improve system performance, and when fetch_thres₀>0, fetch_thres_(t)=fetch_thres₀−1; if benefit_(t)<0, it indicates that data block prefetching will decrease the system performance, and fetch_thres_(t)=fetch_thres₀+1; otherwise, turning to the next step; 4-2-2) if benefit_(t)>benefit_(t−1), it indicates that a prefetching threshold adjusting method used in the previous 10⁹-clock period helps improve the system performance, and a threshold adjusting method remains the same as the threshold adjusting method in the previous 10⁹-clock period, that is, if the prefetching threshold is decreased in the previous 10⁹-clock period and fetch_thres_(t−1)>0, fetch_thres_(t)=fetch_thres_(t−1)−1; and if the prefetching threshold is increased in the previous 10⁹-clock period, fetch_thres_(t)=fetch_thres_(t−1)+1; otherwise, using a threshold adjusting method opposite to that of the previous 10⁹-clock period; and 4-2-3) updating benefit_(t−1) to be benefit_(t).
 4. The system according to claim 1, wherein step 5 comprises the following sub-steps: (5-1) determining, by using a least recently used (LRU) algorithm, an address of a DRAM cache page to be reclaimed if the DRAM cache is full, and looking up the extended page table to obtain an address of an NVM page corresponding to the DRAM cache page if the DRAM cache page is written or dirty, writing modified cache page content back to a corresponding NVM, setting the P flag of the TLB entry corresponding to the DRAM page as 0, and setting a PD flag of an extended page table entry as 0; or turning to step (5-2) if the DRAM cache is not full; (5-2) calling a buddy allocator in DRAM cache management module to allocate a free DRAM page, and setting an address of the free DRAM page as dram_ppn; (5-3) inserting a mapping from the NVM page to the DRAM cache page into the extended page table and the TLB, that is, setting an overlap field of the extended page table as dram_ppn, and the PD flag of the extended page table as 1; setting the overlap tlb field of the TLB as dram_ppn, and the P flag of the TLB as 1; and (5-4) calling a page copy interface in the memory controller to copy the NVM page into the corresponding DRAM cache page.
 5. A Dynamic Random Access Memory/Non-Volatile Memory (DRAM/NVM) hierarchical heterogeneous memory system with software-hardware cooperative management, comprising a modified Translation Lookaside Buffer (TLB) layout, an extended page table, and a utility-based data prefetching module, wherein: the modified TLB is configured to cache mappings from some virtual pages to NVM pages and DRAM cache pages, to improve an address translation speed; in addition, reserved bits in the modified TLB is further configured to collect information of application access frequency, to assist the utility-based data prefetching module in data prefetching; the extended page table stores all mappings from virtual pages to physical NVM pages and mappings from NVM pages to DRAM cache pages, wherein a PD flag is set, to indicate a type of a page frame recorded in a page table entry; and the utility-based data prefetching module is configured to replace a demand-based data prefetching policy in a conventional cache architecture, wherein the utility-based data prefetching module mainly comprises three sub-modules: a memory monitoring module, a prefetching threshold runtime adjustment module, and a data prefetcher: (1) the memory monitoring module is configured to acquire cache utilization and main memory page access frequency information from the modified TLB and a memory controller, and use the cache utilization and the main memory page access frequency information as input information for runtime adjustment of a data prefetching threshold; (2) the prefetching threshold runtime adjustment module is used to dynamically adjust the data prefetching threshold according to runtime information provided by the monitoring module, to improve usage of a DRAM cache and usage of bandwidth between the NVM and the DRAM; (3) the data prefetcher is used to: {circle around (1)} trigger a buddy allocator of the DRAM cache management module to allocate a DRAM cache page for caching an NVM page; {circle around (2)} copy content of the NVM page to the DRAM cache page allocated by the DRAM cache management module; and {circle around (3)} update the extended page table and the TLB.
 6. The DRAM/NVM hierarchical heterogeneous memory system with software-hardware cooperative management according to claim 5, wherein data structures of last-level page table entries and TLB entries are modified, and the mappings from virtual pages to physical pages and the mappings from NVM pages to DRAM cache pages are uniformly managed, which improves a memory access speed while ensuring correctness of cache access and reclaiming, and eliminates hardware overheads of a conventional DRAM/NVM hierarchical heterogeneous memory system.
 7. The DRAM/NVM hierarchical heterogeneous memory system with software-hardware cooperative management according to claim 5, wherein an overlap tlb field is formed by using reserved bits of a TLB entry, and the overlap tlb field is fully utilized to monitor the page access frequency information and record a DRAM page number corresponding to an NVM page.
 8. The DRAM-NVM hierarchical heterogeneous memory system with software-hardware cooperative management according to claim 5, wherein the memory monitoring module of the utility-based data prefetching module acquires memory page access frequency information and cache utilization information from the TLB and the memory controller; and the data prefetching threshold is dynamically adjusted by using a prefetching threshold runtime adjustment algorithm, to improve cache usage and usage of bandwidth from the NVM to the DRAM cache. 